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Abstract. The probability of observing x t at 
time t, given past observations x\...xt-\ can be 
computed with Bayes' rule if the true generat- 
ing distribution fi of the sequences XxX-iX^... is 
known. If /! is unknown, but known to belong to 
a class M one can base ones prediction on the 
Bayes mix £ defined as a weighted sum of dis- 
tributions i/gM. Various convergence results 
of the mixture posterior £ t to the true poste- 
rior \it are presented. In particular a new (ele- 
mentary) derivation of the convergence £t/ fM ~ * 1 
is provided, which additionally gives the rate of 
convergence. A general sequence predictor is al- 
lowed to choose an action y t based on x\...Xt-\ 
and receives loss £ Xt y t if Xt is the next symbol 
of the sequence. No assumptions are made on 
the structure of t (apart from being bounded) 
and Ai. The Bayes-optimal prediction scheme 
A^ based on mixture £ and the Bayes-optimal in- 
formed prediction scheme A^ are defined and the 
total loss L(: of A^ is bounded in terms of the total 
loss L M of Ajj. It is shown that is bounded for 
bounded and L^/L^— >1 for L^— >oo. Conver- 
gence of the instantaneous losses are also proven. 

Keywords. Bayesian sequence prediction; gen- 
eral loss function and bounds; convergence; mix- 
ture distributions 



1 Introduction 

Setup. We consider inductive inference problems in the 
following form: Given a string x\X2---Xt-i, we want to pre- 
dict its continuation x t . We assume that the strings which 
have to be continued are drawn from a probability distri- 
bution fi. The maximal prior information a prediction 
algorithm can possess is the exact knowledge of /i, but in 
many cases the true generating distribution is not known. 
In order to overcome this problem a mixture distribution £ 
is defined as a w v weighted sum over distributions vG A4, 
where M. is any discrete (hypothesis) set including /i. We 
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assume that M. is known and contains the true distribu- 
tion, i.e. fi^M. Since the posterior £ t can be shown to 
converge rapidly to the true posterior p, t , making decisions 
based on £ is often nearly as good as the infeasible optimal 
decision based on the unknown (x |MF98j . In this work we 
compare the expected loss of predictors based on mixture 
£ to the expected loss of informed predictors based on fi. 

Contents. Section [21 introduces concepts and nota- 
tion needed later, including strings, probability distri- 
butions, mixture distributions, expectations, and vari- 
ous types of convergence and distance measures. Sec- 
tion summarizes various convergence results of the mix- 
ture distribution £ to the true distribution p. We pro- 
vide a new (elementary) derivation of the posterior con- 
vergence in ratio, which is not based on Martingales, 
but on the Hellinger distance, and compare it to related 
known results |Ooo53l ILWfl IVov87l IVI.OOmI. Section 
0] introduces the decision theoretic setup, where an ac- 
tion/prediction y t results in a loss £ Xt y t if x t is the next 
symbol of the sequence. Improving upon previous results 
in |MF98IIHut01allHut01b| . the expected total (or cumula- 
tive) loss made by the Bayes-optimal prediction scheme 
based on mixture £ minus the expected total loss of the 
optimal informed prediction scheme based on /i is bounded 
by O(yJTT^). Some popular loss functions, including the 
absolute, square, logarithmic, Hellinger, and error loss are 
discussed. A Proof of the loss bound is given in Sectional 
Convergence of the instantaneous losses are briefly stud- 
ied in Section |SJ Section \7\ recapitulates the assumptions 
made in this work and possible relaxations, mentions some 
optimality properties of £ proven in Hut02a , and provides 
an outlook to future work. 



2 Preliminaries 

Strings and Probability Distributions. We denote 
strings over a finite alphabet X by x\X2...x n with xt € 
X. We abbreviate x n :m '■=XnX< n +\...x m -\x m and x <n '■= 
Xi...x n -i. We use Greek letters for probability distribu- 
tions/measures, especially p for arbitrary ones, p&Ai for 
the true (generating) one, v £ M. for arbitrary ones in M. , 
and £ for the mixture pp. Let p(x\ : t) be the probabil- 
ity that an (infinite) sequence starts with x\...x t . The 
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conditional p probability that a given string x\...Xt-\ is 
continued by x t is p t :— p(x t \x <t ) — p(xi :t )/ p(x <t ). The 
considered prediction schemes will be based on these pos- 
teriors. 

Mixture distributions. Let M. :={pi,p2,---} be a finite 
or countable set of candidate probability distributions on 
strings. We define a weighted average on M. 

€(xi:n) ■= J~] W v 'v{xx:n)i J~] W v = 1, W v > 0. (1) 

veM ueM 

£ is called a Bayes-mixture. The weights u>„ may be inter- 
preted as the prior belief in environment v £ M. . The most 
interesting property the mixture distribution £ is that it 
multiplicatively dominates all distributions in M: 

£(xi-.n) > Wv-v{xi: n ) for all veM. (2) 

In the following, we assume that M. is known and con- 
tains the true distribution, i.e. p e M. If Ai is cho- 
sen sufficiently large, then p 6 M. is not a serious con- 
straint. Generic classes, especially where M. contains all 
(semi)computable probability distributions are discussed 
in |Sol78l ILV97I IHutOlal lHut02a| . Generalizations to the 
case where M. does not contain p are briefly discussed 
in Hut02a and more intensively in a related context in 
[Hril98l . 

Expectations and convergence measures. We use 

E[..] to denote expectations w.r.t. the "true" distribution 
p and abbreviate E t [..] :=E[..|ir <t ]. If [..] depends on x\- t 
only, i.e. is independent of aJt+i:ooj we have 

E[..] p(x 1:t )[..} and E t [..] := K^t\x <t )[..], 

where £ sums over all xt or x\ : t for which pixi-t) 7^0. 
Similarly we use P[..] to denote the p probability of event 
[..]. We need the following kinds of convergence of a ran- 
dom sequence Z\,Z2,... to (a random variable) 2*: 

with probability 1 (w.p.l) P[zt —* z*\ = 1 

in probability (i-P-) Ve:P[|z t — z»| 

in mean sum (i.m.s.) X^t^i E[(z< — z *) 2 ] < 00 

in the mean (i.m.) E[(z t — z») 2 ] 

Convergence in one sense may imply convergence in an- 
other sense. The following implications are valid, strict, 
and complete: 

^w.p.l N . 
i.m.s. ( \ i.p. 

N i.vn. / 

Convergence i.m.s. is very strong: it provides a rate of con- 
vergence in the sense that the expected number of times 
t in which Zt deviates more than e from z» is finite and 
bounded by £ t tiE[(z t -z») 2 ]/e 2 . 



Distance Measures. We need several distance measures 
between probability distributions yi > 0, Zj > 0, £ j; j/i = 
£jZj = l, ' = {li-:^}> namely the 

absolute distance: a = J2i \Vi ~~ Z A (3) 

square or Euclidian distance: s = J2i(Vi ~ z i) 2 

Hellinger distance: h = YliiVVi — \fz~i) 1 

relative entropy or KL divergence: d — £^ yi In ^ i 

absolute divergence: b = J2i Hi I m f 1 1 

All bounds we prove in this work heavily rely on the fol- 
lowing inequalities: 

s < d, h < d, b-d < a < V2d. (4) 

See |Hut()la| . |CT91I Lem.12.6.1], and |fjJV198l pl78] for 
proofs of s<d, a<V2d, and h<d, respectively. b—d<a 
is elementary and follows from — lax < - — 1. Inequality 
s < d is a generalization of the binary N — 2 case used in 
PoTrSl iHutOlcl ILY97) . If we insert 

X = {1,...,N}, AT = |*|, i = xt, (5) 

y t = p t := p(x t \x <t ), Zi = £ t := £(x t \x <t ) (6) 

into J2J) we get various instantaneous distances (at time t) 
between p and £. If we take the expectation (over x<t) 
and sum over t = l..n, (£™ =1 E[...]) we S e ^ var i° us total 
distances between p and £: 



a t (x <t ) 






'■— 2^t= 




s t {x <t ) 


— E^^t-6) 2 , 


S n 


:— 2^t= 


:1 EN 


h t {x <t ) 


■^J2 xt (Vi r t-Vlt) 2 , 


H n 


: ~ L^t= 


a EN (7) 


dt(x <t ) 




D n 


■~ 2-it= 


:1 E[dt] 


b t {x <t ) 


-E. t M t |lnf|, 


B n 


■~ 2^t= 


a EN 



3 Convergence of £ to /i 

For Z)„ the following representation and bound is well 
known and crucial |Sol781 ILY97I IHut01a| 



D n = J2ndt(x <t )} - E[ln^p^] < Inw^ 1 < 00 

(8) 

The inequality follows from (J2J. The following theo- 
rem summarizes various bounds and convergence results 
needed later. The major new part is Theorem^iv) which 
allows for an elementary proof of £t/pt— > ^ w.p.l based on 
the Hellinger distance. 

Theorem 1 (Convergence of £ to p) Let there be se- 
quences x\Xi... over a finite alphabet X drawn with prob- 
ability p{x\ :n ) for the first n symbols. The mixture con- 
ditional probability £'t '■— £( x t\ x <t) °f the next symbol x' t 
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given x <t is related to the true conditional probability 
fi' t :=£(x' t \x<t) in the following way: 



since E[d t ] — > 0. I.e. 



• 0, which implies 



i. p. 



a) 

Hi) 
iv) 



^ 2 = s t (a; 



S n < L>„ < law,. < OO 



E ai 04 

£t — Mt ~ 



3) 





<t. 



< dt 



1. The explicit appearance of n in the last expression of 
(vi) prevents proving stronger convergence of £t//it w.p.l 
from (vi). Similarly BarOQj Th.2] shows (in our notation) 

w.p.l convergence of \n ^ Xl:t ) in Li-norm, which implies — 



for t — > oo w.p.l (and i.m.s) for any xjbut is also not strong enough to derive (v). 



v) a/ — 



E?=iE[(V& 
i 



i.m.s 



l) 2 ] < i?n < A. < Inw" 1 < oo 



and & 

Mi 



ui) b t -d t < a t < ^/2d tl B n — D 1 



1 w.p.l /or < 



where fi t , £ t are defined in d t , D„ are f/ie relative 
entropies Q), and u>^ is i/ie weight £7J) o/ /x in £. 

Proof. The inequality in (ii) follows from the definitions 
J2J and from the entropy inequality s < d From the 
definition and finitcness of Dqo (jHJ and from dt(x < t) > 
one sees that yjd t (x <t ) for t 

w.p. 1 



<Zt(x 



0. 



oo , which implies 
The (first) inequality in (i) follows from 



(ii) by taking the E expectation and the Et=i sum. (Hi) 
follows from (i) by dropping ■ (iv) and (v) are related 
to (i) and (Hi), but are incomparable convergence results. 
(iv) is proven as follows: 



MVS - !) 2 ] = EXVl - 1) 2 = 

EL t (v^- V^) 2 < M*<0 < d t (x <t ). 



(9) 



The inequalities follow from (7J and n<d (@J. (iv) now fol- 
lows by taking the E expectation and the Yt=i sum - i v ) 
follows from (iv) by the definition of convergence i.m.s., 
which implies convergence w.p.l. The first two inequal- 
ities in (vi) immediately follow from inequalities (@J and 
definitions J2J. The third inequality of (vi) follows from 
the first by linearity of E and The last inequality 
follows from 



The elementary proof for (v) w.p.l given here does not 
rely on the semi-martingale convergence Theorem ^po53 
pp. 324-325] as the proof of Gacs in |LV97I Th.5.2.2]. Fur- 
thermore, (iv) (and (i)) give a "rate" of convergence in the 
sense that the number of times £t can depart from /t t by 
more than e in the sense of | y^t/lM — 1 1 > e (or | £' t — \i ' t | > e) 
is bounded by e^lnui" 1 . Note also the subtle difference 
between (in) and (v). If xi :00 is a /i-random sequence, 
and x'i.qq is any (possibly constant and not necessarily \x- 
random) sequence then /4 — £t converges to zero, but no 
statement is possible for £' t / f/ t , since liminf/4 could be 
zero. On the other hand, if we stay on the //-random se- 
quence (x' 1:oo — Xi-oo), (v) shows that £t//*t^l (whether 
inf/x t tends to zero or not does not matter). Indeed, it 
is easy to see that £(l|0< t )//i(l|0< t ) oc t — > oo diverges 
for M = n(l\x <t ) ■= \t~^ and v(l\x <t ) ■= \t~ 2 , 

although 0i :oo is //-random Hu t02a| . 

An interesting open question is whether £ converges to /i 
(in difference (Hi) or ratio (v)) individually for all Martin- 
Lof (M.L.) random sequences. Convergence M.L. implies 
convergence w.p.l, but the converse may fail on a set of 
sequences with /i-measure zero. A convergence M.L. re- 
sult would be particularly interesting for M being the set 
of all enumerable semimeasures and £ being Solomonoff 's 
universal prior. Vovk's interesting results Pvov87) are not 
strong enough to settle this point, and the proof given in 
|VL00a| is incomplete. See Hut02a for further discus- 



4 Loss Bounds 



vAn = i E?=i E N<^ELiE [VM] < (io) 

< \ you vnm < yji e?=i E[2d t ] = yp; 

where we have used Jensen's inequality for exchanging the 
averages (^E"=i an d E) with the concave function ^J~. 

□ 

Since the conditional probabilities are the basis of the 
prediction algorithms considered in the next section and 
£' t converges rapidly to we expect a good prediction 
performance if we use £ as a guess of /i. Performance 
measures are defined in the next section. 

Without the use of the Hellinger distance, a somewhat 
weaker statement than (v) can be derived from (vi): 

E|lnf | =E[6 t ] < E[d f ]+E[V2^] < V[d t ]+^2E\dT] ^ 0, 



Setup. A prediction is very often the basis for some deci- 
sion. The decision results in an action, which itself leads 
to some reward or loss. We assume that the action itself 
does not influence the environment. Let £ Xt y t €-2? be the 
received loss when acting y t £ y, and xt € X is the actual 
outcome. In many cases the prediction of Xt can be iden- 
tified or is already the action y t . X = y in these cases. 
For convenience we name an action a prediction in the fol- 
lowing, even if X ^ y. The true probability of the next 
symbol being xt, given x < t, is n(xt\x<t). The expected 
loss when predicting y t is ~Et[£ Xt y t ]. The goal is to mini- 
mize the expected loss. More generally we define the A p 
prediction scheme 

yf" := argmin Vp(x t |x< t )4 tyt (11) 

vt&y * — ' 
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which minimizes the p-expected loss. 1 As the true distri- 
bution is /i, the actual /^-expected loss when A p predicts 
the t th symbol and the total //-expected loss in the first n 
predictions are 

n 

lf»(x <t ) := E t [l xty A fi ], l£> := £ E[Z t A "(z <t )]. (12) 

t=i 

Let A be any (causal) prediction scheme (deterministic or 
probabilistic does not matter) with no constraint at all, 
predicting any yf £y with losses / A and L A similarly de- 
fined as (|T2^I . If fj, is known, A M is obviously the best pre- 
diction scheme in the sense of achieving minimal expected 
loss 

L A f < L A for any A. (13) 

We prove the following loss bound for the A^ predictor 
based on mixture £: 

Theorem 2 (Loss bound) Let there be sequences 
X\X2--- over a finite alphabet X drawn with probability 
fi>(%i:n) f or the first n symbols. A system taking action 
(or predicting) yt&y given x<t receives loss £ XtVt G [0,1] if 
Xt is the true t th symbol of the sequence. The A p -system 
j lip acts (or predicts) as to minimize the p-expected loss. 
A^ is the prediction scheme based on the mixture £. A p 
is the optimal informed prediction scheme. The total 
fi-expected losses Ln 6 of A^ and L„" of A p as defined in 
\lty) are bounded in the following way 

< L£ e - < D n + yj4Ln"D n + D* < 2D„ + 2\J L^ D n 

where the relative entropy D n is bounded by hru;~ < 
oo. 

The implications of Theorem[5]can best be read off from 
the following corollary. 

Corollary 3 (Loss bound) Under the same conditions 
as in Theorem^ the following relations hold 

i) £oo is finite L^S is finite, 

ii) < 21)00 < 21nu;~ 1 for det. p if \fx3y£ xy = 0, 

Hi) L^/Ln- = 1 + 0((Ln")~ 1/2 ) -> 1 for it" -> 00, 
iv) Ln € - L%" = O(^L^), 
Let A be any prediction scheme. 

v) L^<Ll 

vi) L A > L% - 2^L**D n > L% - 0{\[l£), 
vii) L^/Li < l + 0((L A )- 1 /2). 

1 argmirij ; ( ) is defined as the y which minimizes the argument. A 

tie is broken arbitrarily. If y is finite, then y^ p always exists. For 

infinite action space y we assume that a minimizing y^ p £ y exists, 
although even this assumption may be removed. 



The Corollary is a trivial consequence of Theorem [21 and 
l|13fl . (vi) follows from Theorem [21 by replacing L n 1 ' with 
Z/ A and solving the quadratic inequality w.r.t. L A . The 
main message is that the total loss of the mixture A^ 
predictor is finite if the total loss L„ of the informed A M 
predictor is finite, and that Ln 4 /Ln" — *1 if L^£ is not fi- 
nite, (vi) shows that no (causal) predictor A whatsoever 
achieves significantly less (expected) loss than Af . Worst 
case bounds for aggregating strategies, especially the one 
derived in |CB97| . explicitly depend on the comparison 
class. There are always predictors which perform signifi- 
cantly better than the aggregating strategy. On the other 
hand these algorithms have the remarkable property that 
the bounds hold for any sequence, whereas our bounds 
only hold in an expected sense and depend on the environ- 
ment fi 6 M. . See |Hut01b| for a more detailed discussion 
of the bounds in general and this duality in particular. 

Loss Bound of Merhav &; Feder. The first general loss 
bound with no structural assumptions on p and £ (except 
boundedness) has been derived in a survey paper by Mer- 
hav and Feder in |MF98I Sec.3.1.2]. (The s pecial cas e 
of the error- loss has earlier been considered in BCH93 ). 
They showed that the regret L n ^ — Ln" is bounded by 
i m ax\J2nD n for £g [0,£ max \- Assuming £ max = 1 (general 
£max can be recovered by scaling) their bound reads (in 
our notation) 

L% ( - L A " < A n < ^2nD n . (14) 

In Section we prove lf s (x <t ) — lf"(x <t ) < a t (x <t ) < 
\j2d t (x < t). Taking the the expectation E and the average 
~X)fc=i an d using Theorem n shows (|H|) . 

Bound (|14l) and our bound (Theorem [2J are in general 
incomparable. Since 2£>oo is finite and L n " < n, bound 
l|14|) can be at best a factor V2 and an additive constant 
better than our bound. On the other hand, for large n 
and for Ln" < § our bound is tighter. The latter condition 
is satisfied if the best predictor A M suffers small instanta- 
neous loss < I on average. Significant improvement occurs 

if Ln" does not grow linearly with n, but is for instance 
finite (see Corollary[3] especially (i) and (ii)). 

Example loss functions. The case X=y with unit error 
assignment £ xy = 1 — S xy (S xy = 1 for x = y and 5 xy = for 
x ^ y) has already been discussed and proven in [HutOlaj . 
In this case Ln" = En" is the total expected number of 
prediction errors. For X = y = {0,1}, A p is a thresh- 
old strategy with y t " =argmm ye { 0:1 }{pi£iy+p 4y} = 0/l 
for Pl > 7 , where 7 := toi J t %+ e t ™_ tll and Pi = p(i\x <t ). 
In the special error case £ xy = l — 5 xy , the bit with the 
highest p probability is predicted (7 = |). In the fol- 
lowing we consider some standard loss functions for bi- 
nary outcome <Y = {0,1} and continuous action y in the 
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unit interval y = [0,1]. The absolute loss is defined as 
£xy = |ar — 2/j S [0,1]. The A p scheme predicts y t " = 
argmin ye [ 0il ]{pi(l-y) + i oo2/} = 0/1 for po<Pi- Since all 
predictions y lie in the subset {0,1} C [0,1] and \x — y\ = 
^ — 5xy for ye {0,1} this case coincides with the binary er- 
ror case above. The same holds for the a-loss \x— y\ a with 
< a< 1. The /x-expected loss is l t " = p(i\x <t ) for the i 
with pi> i. For the quadratic loss £ xy — (x — y) 2 € [0,1] the 

action/prediction yf p = argmin ye [ a]{pi(l-2/) 2 + Poy 2 } = 
pi is proportional to the p-probability of x± = 1 and 
?f p =E t (l — p(x t |a; <t )) 2 . For the a-loss |x — y| a with a>l 

we get y t p = (1+ a ~^fpoJpi)~ l ■ For arbitrary finite alpha- 
bet X and vector-valued predictions y the quadratic loss 
may be generalized to £ xy = 3y T A x y + b^y + c x . The 
Hellinger loss can be written for binary outcome in the 
form l xy = 1 - y/\l-x-y\ € [0 1] with y?" = 

and l t p = 1 — (/XoPo+A l iPi)/vPo+/ i- The logarithmic 
loss £ xy = — ln|l — x— y\ € [0,oo] is unbounded. But since 
the corresponding action is yf p = p\ the expected loss is 
l t p — — Ednp(a;t|a;<t). Hence l^ — l t p — dt and the total 
loss excess L n ( — £« p = D n < lmu" 1 is finitely bounded 
anyway and Theorem [3 is not needed. 

5 Loss Bound Proof 

Main steps. The first inequality in Theorem [21 has al- 
ready been proven 1)13(1 . For the second and last inequal- 
ity, we start looking for constants A > and B > 0, which 
satisfy the linear inequality 

Ln ( < {A+1)L*» + (B + l)D n . (15) 

If we could show 

l^(x <t ) < A'l^(x <t )+B'd t (x <t ) (16) 

with A' :=A+1 and B' :=B+1 for all t <n and all x <t) (fTSf) 
would follow immediately by summation and the definition 
of L n and D n . With the abbreviations the m=y t p and s = 
yf 5 and the abbreviations © and © the loss and entropy 
can then be expressed by l t 6 =J2iVAs, k" ' =HiVdim and 
^t — J2i'y^ n 'T- Inserting this into 1(16(1 we get 

N N N 

X>A* < A'Y,ydim + B'Y,yi^j (17) 

i— 1 i— 1 i—1 

By definition l(ll|) of y t M and 5 we have 

^/Vdim^^vdij and ^ Zj^a < ^ (18) 

z i i i 

for all j. Actually, we need the first constraint only for 
j = s and the second for j = m. In the final paragraph of 



this section we reduce the problem to the binary N = 2 
case, which we will consider in the following. We take 
^2i =0 instead of 2<=i f° r convenience. 

l l ? 

B' Vyan^ + V yi(A'£ im -£ is ) > (19) 

i=0 i=0 

The cases £i m > £i S Vi and £i S > HimSi contradict the 
first /second inequality l |T5|l . Hence we can assume £ 0m > 
£q s and l\ m < £\ s . The symmetric case £om < @0s and 
£\m > ^ls is proven analogously or can be reduced to the 
first case by renumbering the indices (0<-> 1). Using the 
abbreviations a:=£o m - £q s , b:=£i s -£i m , c:=yi£i m +y £ 0s , 
y = yi = 1 — yo and z = z\ = 1 — zq we can write 1(19(1 as 

f(y,z) := (20) 

B'[yln% + (l-y)ln±E%]+ A'(l-y)a- yb + Ac > 

for zb<(l — z)a and < a,b,c,y,z < 1. The constraint ((18(1 
on y has been dropped since ((20(1 will turn out to be true 
for all y. Furthermore, we can assume that d:=A'(l—y)a— 
yb<0 since for g?>0, / is trivially positive. Multiplying d 
with a constant > 1 will decrease /. Let us first consider 
the case z < 1. We multiply the d term by 1/6 > 1, i.e. 
replace it with A'(l — y)% — y. From the constraint on z 
we known that | > rz-. We can decrease / further by 
replacing | by and by dropping Ac. Hence, (|2U|) is 
proven for z < \ if we can prove 

A(y,z) :=£?'[...] + A'(l-y) T ^- 2/ > 0forz<±. (21) 

In the next paragraph of this section we prove that it holds 
for B > 4- + 1 . The case z > i is treated similarly. We scale 
d with l/a>l, i.e. replace it with A'(l — y) — y—. From 
the constraint on z we know that - < — - . We decrease f 

a — z J 

further by replacing - by and by dropping Ac. Hence 
1(20(1 is proven for z > \ if we can prove 

f 2 ( y ,z):=B'[...}+A'(l-y)-y^ > 0forz>i. (22) 

In the second next paragraph of this section we prove that 
it holds for B > i + 1. So in summary we proved that 
holds for B>\ + 1. Inserting B = \ + l into (JTSJl 
and minimizing the r.h.s. w.r.t. A leads to the last bound 

of Theorem [5] with A = y D n /Ln*- Actually inequalities 
J2D and |23) also hold for B> jA+i, which, by the same 
minimization argument, proves the slightly tighter second 
bound in Theorem [21 Unfortunately, the current proof is 
very long and complex, and involves some numerical or 
graphical analysis for determining intersection properties 
of some higher order polynomials. This or a hopefully 
simplified proof will be postponed. The cautious reader 
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may check the inequalities (|21[1 and lll'l't numerically for 
B=\A+\. □ 

Binary loss inequality for z < h (12 111 . We now prove 
fl(y,z)>0 for z<i and suitable A' = A+1 and B' = B+l. 
We do this by showing that /i > at all extremal values 
and "at" boundaries, /i — ► +00 for z — > 0, if we choose 
B'>0. For the boundary z = ^we lower bound the relative 
entropy by the sum over squares s < d Q 



52 (z) := P'-l)B'z-^' + 2z(l-z)](B'+l~i) + 2(l-z) 2 

We have reduced the problem to showing 52 > 0. Since 
(B' + l — -)>0 it is sufficient to show that the bracket is 
positive. We solve [...]>0 w.r.t. B and get 



B > 



1 - 2z(l - z) 



1 

A 



1 - z 



h{y,\)>2B'{y-\) 2 +A'{l-y)-y>Q for 



as can be shown by minimizing the r.h.s. w.r.t. y. Further- 
more for A>4 and B>1 we have /i(y,±)>2(l-y)(3-2y)> 
0. Hence fi{y,\)>0 for B>^+1, since for A>4 it implies 
B>1 and for ,4 < 4 it implies B > \ A+ \ . The extremal 
condition df /dz = (keeping y fixed) leads to 



For B > -j + 1 this is satisfied for all \ < z < 1 . In summary 
we have proved (JUJ) for B>-j + l and A > 0. □ 

General loss inequality (|17|) . We reduce 



N 



JY 



/(y, Z ):=B'^ i ln-+A'^y i 4 ro -^t/A s > (23) 



for J2i=i z idi > 0, := i im - 



V = V ■= z- 



B'{l-z) + A l 
B'(l-z) + A'z' 



(24) 



Inserting y* into the definition of f\ and, again, replacing 
the relative entropy by the sum over squares (yln- + (l— 



y)\n^L>2{y- 
get 



2 ), which is a special case of s < d (0J , we 



to the binary N=2 case. We do this by keeping y fixed and 
showing that / as a function of z is positive at all extrema 
in the interior of the simplex A := {z : J2i z i = 1 A — 0} °f 
domain of z and "at" all boundaries. First, the boundaries 
Zi^O are safe as f^oo for B' >0. Variation of / w.r.t. 
to z leads to a minimum at z — y. If J^^Zjdj > 0, we have 



h(y*,z) > 2B'(y*~z) 2 + A'(l-y*) T ^-y* -- 
9l (z) := 2B'A' 2 z(l -z) + [(A' - l)B'(l - z) - 



z(l-z)-gi(z) 

-A'KB'+A'^ 



We have reduced the problem to showing g\ > 0. If the 
bracket [...] is positive, then gi is positive. If the bracket is 
negative, we can decrease gi by increasing j^— < 1 in (B'+ 
A' t^— ) to 1. The resulting expression is now quadratic in 
z with minima at the boundary values z = and z = ^. It 
is therefore sufficient to check 

5i(0) > {AB — l)(A + B + 2) >0 and 



51(3) > ^(AB-l)(2A + B + 3) > 

which is true for B>j^. In summary we have proved H21(l 
for B>\ + 1 and A>0. □ 

Binary loss inequality for z>^ (1221) . We now prove 
we show f2(y,z)>0 for z>i and suitable yl' = yl+l>l and 
B' = B+l >2 similarly as in the last paragraph by proving 
that /2 > at all extremal values and "at" boundaries. 
/2^+oo for z— *T. The boundary z = ^ has already been 
checked in in the last paragraph. The extremal condition 
df jdz — (keeping y fixed) leads to 



the first inequality we used A 1 > 1. If 'Y^i'zA < 0, z = 
y is outside the valid domain due to the constraint (|24|l 
and the valid minima are attained at the boundary AnP, 
P := { z : J2i z idi — 0}- We implement the constraints with 
the help of Lagrange multipliers and extremizc 

L(y, z) := /(y, z) + B'A «i + B'ii Zidi. 

dL/dzi = leads to yi = y* : = Zi(A + iidi). Summing this 
equation over i we obtain A = 1 . \x is a function of y for 
which a formal expression might be given. If we eliminate 
yi in favor of Zi, we get 



f(y* 



J2i c l z l with 



y 



B'z 



(B' + l)z-l 



Inserting y* into the definition of fi and replacing the 
relative entropy by the sum over squares s<d Q), we get 



h (V* ,z) > 2B'(y* - zf + A' ( 1 - y* ) - y* 1=2 = 



z{l-z)-g 2 {z) 
[{B'+l)z-XY ' 



c t := (1 + fJ,di)(B' ln(l + fidi) + A'£ lm - i ia ). 

In principle [i is a function of y but we can treat directly 
as an independent variable, since y has been eliminated. 

The next step is to determine the extrema of the func- 
tion / — J2ciZi for zeAnP. For clearness we state the 
line of reasoning for N = 3. In this case A is a triangle. 
As / is linear in z it assumes its extrema at the vertices of 
the triangle, where all Zi = except one. But we have to 
take into account a further constraint z£P. The plane P 
intersects triangle A in a finite line (for AnP={} the only 
boundaries are zi — > which have already been treated). 
Again, as / is linear, it assumes its extrema at the ends of 
the line, i.e. at edges of the triangle A on which all but 
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two Zi are zero. With a similar line of arguments for N>2> 
we conclude that a necessary condition for a minimum of 
/ at the boundary is that at most two z% are non-zero. 
But this implies that all but two yi are zero. If we had 
eliminated z in favor of y, we could not have made the 
analogous conclusion because yi = does not necessarily 
imply Zi = 0. We have effectively reduced the problem of 
showing /(y*,z) > to the case N = 2. We can go back 
one step further and prove 123fl for N = 2, which implies 
/(y*,z)>0 for N = 2. A proof of JH for N = 2 implies, by 
the arguments given above, that it holds for all N. This 
is what we set out to show here. □ 

The N = 2 case has been proven in the previous para- 
graphs. This completes the proof of Theorem [21 D 

6 Instantaneous Losses 

Since Ln ( — Lf" is not finitely bounded by Theorem [5] it 
cannot be used directly to conclude analogously if 6 — if " — > 

0. It would follow from £ t — * fi t by continuity if l t 5 and if" 
were continuous functions of £ t and fi f . l t " is a continuous 
piecewise linear concave function, but l t is an, in general, 
discontinuous function of £j (and nt). Fortunately it is 
continuous at the one necessary point £t=/it. This allows 
to bound if 11 —if " in terms of £ t — fi t . 

Theorem 4 (Instantaneous Loss Bound) Under the 
same conditions as in Theorem^ for discrete M. the fol- 
lowing relations hold for the instantaneous losses lf"(x <t ) 
and l t 5 (x<t) at time t of the informed and mixture pre- 
diction schemes A M and : 

ELiEKzf 5 -I?*) 2 } < 2D n < 2\nw- 1 < oo 

it) 0<lf*-lf" < E, t 16 - Mt| < '3 o. 
Hi) < if 5 - if" < 2d t + 2 J if" dt '-^ 0. 

V w .p. 1 

Proof, (ii) follows from 

l f £ ( x <t) ~ if" (x<t) = Ei Vdis - ydim < 

To arrive at the first inequality we added ^2^(1^ — £i s ) 
which is positive due to (|18fl . \t-i s —t-i m \ < 1 since £e [0,1]. 
The last inequality follows from a <V2d Q. (i) follows by 
inserting (ii) and using JSJl. (iw) follows from the proof 

of Theorem by inserting B=± + l= Jlf ft /d t + 1 into 
(|16|l . Convergence to zero holds for /i random sequences, 

1. e. w.p.l, since if" < 1 is bounded. The losses lf"(x <t ) 
itself need not to converge. □ 



Note, that the inequalities in (ii) and (Hi) hold for 
all individual sequences. The sum/average is only taken 
over the current outcome Xt, but the history x <t is fixed. 
Bound (ii) and (Hi) are in general incomparable, but for 
large t and for if" < | (especially if if" — > 0) bound (m) 
is tighter than bound (ii). 

7 Conclusions 

Generalization. The only assumptions we made in this 
work were that /i&M, the loss £ is bounded to [0,1], and 
that the decision yt does not influence the environment, 
i.e. /i is independent yt. No other structural assumptions 
on M. and i have been made. The case /z ^ M. is briefly 
discussed in [Hut 02 a) and more intensively in |Crii98| in a 
related context. Simple scaling allows loss functions in ar- 
bitrary bounded interval HutOlb . Asymptotic loss/ value 
bounds for an acting agent influencing the environment 
can be found in |Hut02bj . 

Optimality properties. In Hut02a we show that there 
are M. and p, € M and weights w v such that the derived 
loss bounds are tight. This shows that the loss bounds 
cannot be improved in general, i.e. without making ex- 
tra assumptions on £, Ai, or w v . We also show Pareto- 
optimality of £ in the sense that there is no other predictor 
which performs better or equal in all environments v S M. 
and strictly better in at least one. Optimal predictors (in 
a decision theoretic sense) can always be based on a mix- 
ture distribution £. This still leaves open how to choose 
the weights. We give an Occam's razor argument that 
the choice w v ~ 2~ K ( V \ where K(v) is the length of the 
shortest program describing is optimal. 

Outlook. The presented Theorems and proofs are in- 
dependent of the size of X and hence should generalize 
to countably infinite and continuous alphabets under (mi- 
nor) technical conditions. An infinite prediction space y 
was no problem at all as long as we assumed the exis- 
tence of yf p S y PJI , but even this is not essential. The 
A p schemes and theorems may be generalized to delayed 
sequence prediction, where the true symbol x t is given 
only in cycle t+d. Another direction is to investigate the 
learning aspect of mixture prediction. Many prediction 
schemes explicitly learn and exploit a model of the envi- 
ronment. Learning and exploitation are melted together 
in the framework of universal Bayesian prediction. A sep- 
aration of these two aspects in the spirit of hypothesis 
learning with MDL VLOObJ could lead to new insights. A 
unified picture of the loss bounds obtained here and the 
loss bounds for predictors based on expert advice (PEA) 
could also be fruitful. Also, bounds which say that the ac- 
tual (not expected) loss suffered by Aj divided by the loss 
suffered by A M is with high probability close to 1 for suffi- 
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ciently large n, would be interesting. Maximum-likelihood 
predictors may also be studied. See |Hut 02a for fur- 
ther references and discussions on the relation Bayes and 
PEA approaches and results, classification tasks, games of 
chances, infinite alphabet, continuous classes M., universal 
mixtures, and others. 

Summary. We compared mixture predictions based on 
Bayes-mixes £ to the infeasible informed predictor based 
on the unknown true generating distribution //. Conver- 
gence results of the mixture posterior £ t to the true poste- 
rior nt have been derived. A new (elementary) derivation 
of the convergence in ratio has been presented, including 
a rate of convergence. The main focus was on a decision- 
theoretic setting, where each prediction y t € X (or more 
generally action y t £ y) results in a loss £ XtVt if x t is the 
true next symbol of the sequence. We have shown that 
the predictor suffers only slightly more loss than the 
A M predictor, improving on various previous results. 
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